Below is the description and attachment. Please complete as soon as possible but at your convenience.
Using two or more algorithms, build an unsupervised time series classifier to identify characteristic day-length patterns in the data.
Use appropriate quantitative metrics to determine the number of time series clusters and to evaluate their quality.
In light of the data and the differences between algorithms, speculate on why a given method yielded quantitatively better clusters.
Code should be written in Python or R.
Interactive Notebooks (Jupyter) including code, comments, visualization, and markdown are preferred.
Submissions should be returned 2 weeks from receipt of the data challenge.
Libraries and Dataframe. Make sure to install proper libraries and that the dataset exists in the same directory of the extracted zip file.
import sys
try:
sys.getwindowsversion()
except AttributeError:
isWindows = False
else:
isWindows = True
if isWindows:
import win32api,win32process,win32con
pid = win32api.GetCurrentProcessId()
handle = win32api.OpenProcess(win32con.PROCESS_ALL_ACCESS, True, pid)
win32process.SetPriorityClass(handle, win32process.HIGH_PRIORITY_CLASS)
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, MiniBatchKMeans, SpectralClustering, AgglomerativeClustering
from sklearn.metrics import silhouette_samples, silhouette_score
from scipy.cluster.hierarchy import dendrogram,ward
from sklearn.cluster import DBSCAN
from kneed import KneeLocator
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D
import gc
from IPython.display import Image
%matplotlib inline
Given that the (300,000 * 14) dataset is anonymized with 'time' as the time stamp and 'd1' to 'd13' as right-skewed numericals from 0 to ~100,000, one can only assume what the dataset is about for the application of such unsupervised time series clustering task.
Gleaming from the project description, "day-length patterns in the data" suggest that 'time hour' vs 'd1' to 'd13' from the dataset should be plotted against each other, respectively, to gain insight about what the dataset is about before unsupervised clustering.
One can also gleam that the project is View Inc-related towards something to do with light ray measurement or electrical current activation in the application of light-activated self-automated window tinting when looking at the intensity of 'd1' to 'd13' over time.
Therefore, all further analysis and results from the dataset is regarded as such assumptions for the report to be tangible towards its application process.
df = pd.read_csv(r'.\Data_Challenge_08_2018_to_05_2019.csv', engine='c', low_memory=False, memory_map=True)
df.describe()
df.info()
df.shape
To make more features from the data given, the time stamp is split to 'date' and 'time hour', where 'time hour' from 'HH:MM:SS' format is converted into decimal format on a 24-hour scale.
The conversion from datetime to numerical is necessary for various algorithms to take in such time series data without the dataset losing data integrity.
The days of the week, where Monday is 1 to where Sunday is 7 also extracted as 'weekday'. Days of the week (1 to 7 scale) instead of days of the month (1 to 31 scale) is extracted because days of the month is arbitrary because day '1' of the month falls on different weekdays of different months.
Also, due to the assumptions discussed that the dataset measures light ray or electrical current intensity, all data are better grouped by Mondays to Fridays as "weekdays" and Saturday and Sunday as "weekends" assuming that the building is an office and light variation scheduling (especially of exterior artificial ambient light) may differ on the weekends due to internal building energy conservation or exterior light pollution.
Because days of the week from 1 to 7 brings more insight, 'week number' from 1 to 52, signifying weeks of the calendar year is extracted instead of "Month" of the datetime because month measurement is arbitrary because month length can vary from 4 weeks (i.e. February 2015) to 6 weeks (i.e. December 2018) depending on what weekday a month starts at.
As a fail-safe, ordinal linux time of the original time stamp is also extracted to keep month, day, time to the exact second kept into one column instead of multiple columns in the event that dimensionality reduction were to remove certain extracted columns of the time stamp and not others.
Non-numerical columns are then removed from the dataset once their numerical counterparts are extracted.
df['date'], df['time hour'] = df['time'].str.split(' ', 1).str
df['time hour']= df['time hour'].str.split(':').apply(lambda x: int(x[0]) + int(x[1])/60)
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df['weekday'] = df['date'].dt.dayofweek + 1
df['weekend'] = np.where(df['weekday'] > 5, 1, 0)
df['weeknumber']=df['date'].dt.week
df['month']=df['date'].dt.month
df['time'] = pd.to_datetime(df['time'])
df['ordinal'] = df['time']
df['ordinal'] = df['ordinal'].apply(lambda v: v.timestamp())
df = df.drop(columns=['date','time'])
Under the assumptions discussed that the d1 to d13 dataset measures light ray or electrical current intensity, the row-mean and row-median are calculated because of the time-sensitivity per row. Mean suggest the average intensity measurement of the 13 windows and median suggest the skewity of the intensity measurement of the 13 windows.
Adding min, max, 25 percentile and 75 percentile of each row may be overkill because there may be too much collinearity among features.
The current state of the dataframe looks as follows:
d_columns = ['d1', 'd2', 'd3', 'd4', 'd5', 'd6', 'd7', 'd8', 'd9', 'd10', 'd11', 'd12', 'd13']
df['d_mean'] = df[d_columns].mean(axis=1)
df['d_median'] = df[d_columns].median(axis=1)
d_columns.extend(('d_mean', 'd_median'))
df
Another dataframe is created for a group-by view of 'd1' to 'd13' intensities per 'time hour' regardless of day of week and week number.
Because the data collection has imperfect intervals and varies on the second-scale, the dataset is round to the first decimal to capture intensity by the time hour.
df2= df.drop(columns=['ordinal','weekend']).round(1)
df3 = df2.drop(columns=['weekday','weeknumber']).groupby(['time hour']).mean()
df3
When all the 'd1' to 'd13' intensities per 'time hour' are plotted on the same plot, the peaks and intensity ranges are different.
As assumed about the nature of the dataset, sun rise is around 6:20 AM and sunset is around 8:00 PM when the mean and median intensities are plotted, which shows a normal distribution for both. This normal distribution suggests that the building has installed windows on all sides; otherwise, the distribution would be skewed towards the cardinal direction that has more installed windows.
The different shapes of the intensity graphs suggest that the window orientation may be different in relation to how the sun travels.
df3.drop(columns=['d_mean','d_median']).plot(figsize=(15,10), title='Intensity vs Time Hour',xticks=range(0,25),ylim =(0, 42000));
df3[['d_mean','d_median']].plot(figsize=(15,10), title='Mean and Median Intensity vs Time Hour',xticks=range(0,25),ylim =(0, 42000));
timecolumns = df3.drop(columns=['d_mean','d_median', 'month']).columns
for i in timecolumns:
df3[i].plot(title = i+' Intensity vs Time Hour', ylim =(0, 42000),xticks=range(0,25));
plt.show()
Assuming that the d1 to d13 intensities are differently orientated windows and that the building is in the northern hemisphere, the sun's intensity is dependent on time of day and week of year, where the sun is closer to the Tropic of Cancer in the summer time and the sun is further towards the Tropic of Capricorn in the winter time.
This means not only does d1 to d13 intensity have a sinusoidal relationship with 'time hour' with the peak at noon and the nadir at midnight, they also have a sinusoidal relationship with 'week of the year' with the peak at June and the nadir at December.
Looking at the d1 to d13 intensities separately with the same y-axis range:
D5 and D6 have a high intensity peak around 9 AM and diminishes over the course of the day and could be southeastern facing windows.
D7, D8 and D13 have a normal distribution with a high intensity peak, which could be southern/overhead facing windows.
D1, D2, D3, and D12 have a normal distribution with a low intensity peak, which could be northern facing windows.
D4 have a peak around 9 AM and low intensity, which could be an eastern facing window.
D9 and D10 have a high intensity later during the day, which could be western facing windows.
D11 has a low intensity with a peak later during the day, which could be a northwestern facing window.
Image(filename="./sun.png")
When hours of the day are grouped by the day of the week, Wednesday seems to have the least intensity and Thursday seem to have the highest intensity.
If the dataset was actually measuring light rays or electrical current, the disparity between days of the week and intensity should not be this great. It might be due to scheduled tree trimming on Wednesdays or unknown auxiliary variables that would induce intensity discrepancies between days of the week.
df4 = df2.drop(columns=['weeknumber', 'month']).groupby(['time hour','weekday']).mean().reset_index().set_index('time hour')
for i in timecolumns:
df4.groupby('weekday')[i].plot(legend=True,title = i+' Intensity vs Time Hour groupedby Weekday', ylim =(0, 50000),xticks=range(0,25));
plt.show()
For the sake of plotting, Month because there are 12 of them vs 52 weeks. The model is fed 'week of the year' and 'month' is only used for plotting purposes.
When hours of the day are grouped by month, October seems to have the least intensity and September seem to have the highest intensity. December's time range is the least while August's time range is the greatest (June's month does not exist).
If the dataset was actually measuring light rays or electrical current, the disparity between month and intensity should be explained away by Summer and Winter solstices. The Summer and Winter solstices does explain the time range of intensity between supposed sunrise and sunset.
Again, unknown auxiliary variables would induce intensity discrepancies between months such as reflection of light due to snow as well as lack of leaves on trees, creating higher intensities during the winter months.
Also note that March's data abruptly ends at 1PM in 2019 (even after the groupby mean), meaning that the dataset is relatively new.
df5 = df2.drop(columns=['weekday']).groupby(['time hour','month']).mean().reset_index().astype(float).set_index('time hour')
for i in timecolumns:
df5.groupby('month')[i].plot(legend=True,title = i+' Intensity vs Time Hour groupedby Month', ylim =(0, 60000),xticks=range(0,25));
plt.show()
Plotting the Pearson's correlation of the original dataset, we see that d1 to d13 have a high positive correlation with one another.
The features of extracted datetime are different glimpses of the time stamp.
Because of the sinusoidal to linear relationship, time and intensities have low relationship when plotting the correlations.
df = df.drop(columns= 'month')
sns.set(style="whitegrid") # one of the many styles to plot using
f, ax = plt.subplots(figsize=(22, 22))
hm = sns.heatmap(df.corr(method = 'pearson').round(2), annot = True, cmap = 'magma', linecolor ='white' , linewidth =0.2)
Due to time and CPU restraints, the original dataset is randomized and subsetted for the same effect of clustering the whole dataset.
A lot of algorithms, including clustering and dimensionality reduction does not scale the more data there is. For the sake of time space complexity, the data is reduced because some algorithm have exponential or polynomial time.
data = df.values
np.random.shuffle(data)
train, test = np.split(data, [int(.9*len(data))])
trim_data = pd.DataFrame(test)
trim_data.columns = df.columns
Due to the nature of the dataset, d1 to d13 are heavily right skewed because the half of the day has no intensity at all. Therefore, d1 and d13 are log transformed for a better distribution. In addition, dummy variables with a tolerance of 20 intensity self categorizes as "zeros" for when the intensity is negligible for night time.
Time hour and weekday distribution is even for the dataset while week of the year, month, and ordinal have sparsity because the dataset does not represent the full year.
The d1 to 13 intensities have a better distribution after log transformation. As expected, intensity is split evenly between day and night.
for column in d_columns:
trim_data[str(column)+'_log'] = np.log(trim_data[column]+1)
trim_data[str(column)+'_zeros'] = np.where(trim_data[column] < 20, 1, 0)
trim_data = trim_data.drop(columns=d_columns)
plt.rcParams['figure.max_open_warning']=40
colnames=list(trim_data.select_dtypes(exclude='O').columns.values)
for i in colnames[0:]:
facet = sns.FacetGrid(trim_data,aspect=2)
facet.map(sns.distplot,i)
facet.add_legend()
facet.fig.suptitle(''.join(map(str, list(["Figure ",colnames.index(i)+1,": ",i," Distribution"]))))
plt.show()
Plotting the Pearson's correlation of the d1 to d13 dataset after logging, we see that the correlations between intensities and dummy zeros are much higher towards 1 and -1, respectively. All the other correlations remain about the same as the dataset prior.
sns.set(style="whitegrid") # one of the many styles to plot using
f, ax = plt.subplots(figsize=(22, 22))
hm = sns.heatmap(trim_data[colnames].corr(method = 'pearson').round(2), annot = True, cmap = 'magma', linecolor ='white' , linewidth =0.2)
Visualizing high-dimensional data by projecting it into a low-dimensional space for easier visualization and grouping.
Here is a function that measures the silhouette score of cosine and euclidean distances, where the mean distance between a sample is compared to all other points in the same class and the mean distance between a sample and all other points in the next nearest cluster.
For Euclidean distance, a value of +1 indicates that the sample is far away from its neighboring cluster and very close to the cluster its assigned and value of -1 indicates that the point is close to its neighboring cluster than to the cluster its assigned. And, a value of 0 means its at the boundary of the distance between the two cluster. Value of +1 is ideal and -1 is least preferred. Hence, higher the value better is the cluster configuration.
To find the optimal number of clusters, find the highest silhouette score per given number of clusters interval.
def cluster_analysis(data,labels, per_cluster=True):
metrics = ['cosine', 'euclidean']
avg_metrics = {}
data = pd.DataFrame(data)
for metric in metrics:
avg_metrics[metric] = silhouette_score(data, labels, metric=metric)
data.loc[data.index, metric] = silhouette_samples(data, labels, metric=metric)
avg_metrics = pd.DataFrame([avg_metrics], index = ['Average for all Clusters'])
if per_cluster:
per_cluster_metrics = pd.DataFrame()
for cluster_id, group in data.groupby(labels):
row = group[metrics].mean()
row['cluster_id'] = int(cluster_id)
per_cluster_metrics = per_cluster_metrics.append([row])
gc.collect()
per_cluster_metrics['cluster_id'] = per_cluster_metrics['cluster_id'].astype(int)
per_cluster_metrics = per_cluster_metrics.set_index('cluster_id')
ss_metrics = pd.DataFrame([(per_cluster_metrics ** 2).sum()], index = ['Sum of Squares for all Clusters'])
else:
per_cluster_metrics = None
ss_metrics = None
return avg_metrics, ss_metrics, per_cluster_metrics
PCA creates low-dimensional embeddings that best preserves the overall variance of the dataset. In PCA, linear projection is used and its components can correlate with the original dataset using heatmap analysis.
The 2D PCA plot shows that the first component is a weak mixture of all the log transformed D1 to D13 variables and the second component is a direct representation of the ordinal datetime.
The 3D PCA plot shows the third component depth is a direct representation of time hour.
One drawback of PCA is that it is a linear projection, meaning it can’t capture non-linear dependencies.
pca = PCA(n_components=3)
trim_data_pca = pca.fit_transform(trim_data.drop(columns='ordinal'))
sns.scatterplot(trim_data_pca[:,0],trim_data_pca[:,1]);
plt.figure(figsize=(20,8))
sns.heatmap(pd.DataFrame(pca.components_,columns=trim_data.drop(columns='ordinal').columns),cmap='coolwarm');
Unfortunately, there are "recording" gaps in the data, which exists because certain months out of the year does not exist, which leads to unusual segmenting of the data and creating unnatural clusters that would not have existed had there been a year's dataset. Therefore, 'week of year' as well as ordinal date time has to be removed to focus on "day-length patterns in the data."
pca = PCA(n_components=3)
trim_data_pca = pca.fit_transform(trim_data.drop(columns=['ordinal','weeknumber']))
sns.scatterplot(trim_data_pca[:,0],trim_data_pca[:,1]);
plt.figure(figsize=(20,8))
sns.heatmap(pd.DataFrame(pca.components_,columns=trim_data.drop(columns=['ordinal','weeknumber']).columns),cmap='coolwarm');
fig = pyplot.figure(figsize=(15, 10))
ax = Axes3D(fig)
ax.scatter(trim_data_pca[:,0],trim_data_pca[:,1],trim_data_pca[:,2])
ax.set_xlabel('trim_data_pca 1')
ax.set_ylabel('trim_data_pca 2')
ax.set_zlabel('trim_data_pca ')
pyplot.show()
Euclidean Distance measures how far apart two points in a n-dimensional space and Cosine Similarity measures the angle between two points at vertex zero.
The harmonic mean of euclidean and cosine Silhouette score is introduced to quantify optimal number of clusters because one would want to optimize for both instead of for one over another in the evaluation process. Given that both range from -1 to 1, it would be ideal to optimize for both.
Due to the nature of unsupervised clustering, there is no previous labels to compare to after the clusters are made. Therefore, to choose the optimal cluster, the highest Euclidean distance between clusters is chosen
best_MiniBatchKMeans_model = (None, None, -1.0)
n_clusters = np.arange(2, 11, 1)
models = []
for n in n_clusters:
models.append((MiniBatchKMeans, {"n_clusters": n}))
for i, (Model, kwargs) in enumerate(models):
print(Model.__name__, kwargs)
silhouette = 0.0
model = Model(**kwargs)
model.fit(trim_data_pca)
unique, counts = np.unique(model.labels_, return_counts=True)
avg_metrics, _, _ = cluster_analysis(trim_data_pca,model.labels_, per_cluster=False)
silhouette_euclidean = avg_metrics['euclidean'][0]
silhouette_cosine = avg_metrics['cosine'][0]
silhouette = (2 * silhouette_euclidean * silhouette_cosine) / (silhouette_euclidean + silhouette_cosine) # harmonic mean
if best_MiniBatchKMeans_model[2] < silhouette:
best_MiniBatchKMeans_model = (model, kwargs, silhouette)
print("Silhouette harmonic mean of euclidean and cosine: ", silhouette)
print("The best results were with the MiniBatchKMeans model: ", best_MiniBatchKMeans_model[1])
print("The results obtained: ")
print("Silhouette harmonic mean of euclidean and cosine: ", best_MiniBatchKMeans_model[2])
n_clusters=best_MiniBatchKMeans_model[1]['n_clusters']
cls = MiniBatchKMeans(n_clusters=n_clusters, random_state=0)
cls.fit(trim_data_pca)
labels = cls.labels_
centroids = cls.cluster_centers_
avg_metrics, ss_metrics, per_cluster_metrics = cluster_analysis(trim_data_pca,labels)
plt.figure(figsize=(15, 10))
for cluster_id in np.unique(labels):
d = np.where(cluster_id == labels)
label = 'Cluster #' + str(cluster_id)
plt.scatter(trim_data_pca[:,0][d], trim_data_pca[:,1][d], label=label, s=1)
for i in range(n_clusters):
plt.text(centroids[:, 0][i], centroids[:, 1][i], i, fontsize=20)
plt.xlabel("trim_data_pca 1")
plt.ylabel("trim_data_pca 2")
plt.show()
fig = pyplot.figure(figsize=(15, 10))
ax = Axes3D(fig)
for cluster_id in np.unique(labels):
d = np.where(cluster_id == labels)
label = 'Cluster #' + str(cluster_id)
ax.scatter(trim_data_pca[:,0][d], trim_data_pca[:,1][d],trim_data_pca[:,2][d], label=label, s=1)
for i in range(n_clusters):
ax.text(centroids[:, 0][i], centroids[:, 1][i],centroids[:, 2][i], i, fontsize=20)
ax.set_xlabel('trim_data_pca 1')
ax.set_ylabel('trim_data_pca 2')
ax.set_zlabel('trim_data_pca 3')
pyplot.show()
display(avg_metrics)
display(per_cluster_metrics)
display(ss_metrics)
best_agglomerative_model = (None, None, -1.0)
n_clusters = np.arange(2, 11, 1)
models = []
for n in n_clusters:
models.append((AgglomerativeClustering, {"n_clusters": n}))
for i, (Model, kwargs) in enumerate(models):
print(Model.__name__, kwargs)
silhouette = 0.0
model = Model(**kwargs)
model.fit(trim_data_pca)
unique, counts = np.unique(model.labels_, return_counts=True)
avg_metrics, _, _ = cluster_analysis(trim_data_pca,model.labels_, per_cluster=False)
silhouette_euclidean = avg_metrics['euclidean'][0]
silhouette_cosine = avg_metrics['cosine'][0]
silhouette = (2 * silhouette_euclidean * silhouette_cosine) / (silhouette_euclidean + silhouette_cosine) # harmonic mean
if best_agglomerative_model[2] < silhouette:
best_agglomerative_model = (model, kwargs, silhouette)
n_clusters = silhouette
print("Silhouette harmonic mean of euclidean and cosine: ", silhouette)
print("The best results were with the Agglomerative model: ", best_agglomerative_model[1])
print("The results obtained: ")
print("Silhouette harmonic mean of euclidean and cosine: ", best_agglomerative_model[2])
n_clusters=best_agglomerative_model[1]['n_clusters']
cls = AgglomerativeClustering(n_clusters=n_clusters)
cls.fit(trim_data_pca)
labels = cls.labels_
# centroids = cls.cluster_centers_
avg_metrics, ss_metrics, per_cluster_metrics = cluster_analysis(trim_data_pca,labels)
plt.figure(figsize=(15, 10))
for cluster_id in np.unique(labels):
d = np.where(cluster_id == labels)
label = 'Cluster #' + str(cluster_id)
plt.scatter(trim_data_pca[:,0][d], trim_data_pca[:,1][d], label=label, s=1)
# for i in range(n_clusters):
# plt.text(centroids[:, 0][i], centroids[:, 1][i], i, fontsize=20)
plt.xlabel("trim_data_pca 1")
plt.ylabel("trim_data_pca 2")
plt.show()
fig = pyplot.figure(figsize=(15, 10))
ax = Axes3D(fig)
for cluster_id in np.unique(labels):
d = np.where(cluster_id == labels)
label = 'Cluster #' + str(cluster_id)
ax.scatter(trim_data_pca[:,0][d], trim_data_pca[:,1][d],trim_data_pca[:,2][d], label=label, s=1)
# for i in range(n_clusters):
# ax.text(centroids[:, 0][i], centroids[:, 1][i],centroids[:, 2][i], i, fontsize=20)
ax.set_xlabel('trim_data_pca 1')
ax.set_ylabel('trim_data_pca 2')
ax.set_zlabel('trim_data_pca 3')
pyplot.show()
display(avg_metrics)
display(per_cluster_metrics)
display(ss_metrics)
best_dbscan_model = (None, None, -1.0)
eps = np.arange(1, 2.1, 0.1)
models = []
for eps_val in eps:
models.append((DBSCAN, {"eps": eps_val, "min_samples": 30}))
for i, (Model, kwargs) in enumerate(models):
silhouette = 0.0
model = Model(**kwargs)
model.fit(trim_data_pca)
unique, counts = np.unique(model.labels_, return_counts=True)
if len(dict(zip(unique, counts))) > 1:
avg_metrics, _, _ = cluster_analysis(trim_data_pca,model.labels_, per_cluster=False)
silhouette_euclidean = avg_metrics['euclidean'][0]
silhouette_cosine = avg_metrics['cosine'][0]
silhouette = (2 * silhouette_euclidean * silhouette_cosine) / (silhouette_euclidean + silhouette_cosine) # harmonic mean
if best_dbscan_model[2] < silhouette:
best_dbscan_model = (model, kwargs, silhouette)
print(Model.__name__, kwargs)
print("Silhouette harmonic mean of euclidean and cosine: ", silhouette)
print("The best results were with the DBSCAN model: ", best_dbscan_model[1])
print("The results obtained: ")
print("Silhouette harmonic mean of euclidean and cosine: ", best_dbscan_model[2])
eps=best_dbscan_model[1]['eps']
min_samples=best_dbscan_model[1]['min_samples']
cls = DBSCAN(eps=eps, min_samples=min_samples)
cls.fit(trim_data_pca)
labels = cls.labels_
# centroids = cls.cluster_centers_
avg_metrics, ss_metrics, per_cluster_metrics = cluster_analysis(trim_data_pca,labels)
plt.figure(figsize=(15, 10))
for cluster_id in np.unique(labels):
d = np.where(cluster_id == labels)
label = 'Cluster #' + str(cluster_id)
plt.scatter(trim_data_pca[:,0][d], trim_data_pca[:,1][d], label=label, s=1)
plt.xlabel("trim_data_pca 1")
plt.ylabel("trim_data_pca 2")
# for i in range(n_clusters):
# plt.text(centroids[:, 0][i], centroids[:, 1][i], i, fontsize=20)
plt.show()
fig = pyplot.figure(figsize=(15, 10))
ax = Axes3D(fig)
for cluster_id in np.unique(labels):
d = np.where(cluster_id == labels)
label = 'Cluster #' + str(cluster_id)
ax.scatter(trim_data_pca[:,0][d], trim_data_pca[:,1][d],trim_data_pca[:,2][d], label=label, s=1)
# for i in range(n_clusters):
# ax.text(centroids[:, 0][i], centroids[:, 1][i],centroids[:, 2][i], i, fontsize=20)
ax.set_xlabel('trim_data_pca 1')
ax.set_ylabel('trim_data_pca 2')
ax.set_zlabel('trim_data_pca 3')
pyplot.show()
display(avg_metrics)
display(per_cluster_metrics)
display(ss_metrics)
link = ward(trim_data_pca)
sns.set(style='white')
plt.figure(figsize=(20,20))
dendrogram(link)
ax = plt.gca()
bounds = ax.get_xbound()
ax.plot(bounds, [500,500],'--', c='k')
plt.show()
After the number of clusters per algorithm was chosen by highest harmonic mean of euclidean and cosine silhouette score, the models are then compared to one another by the highest harmonic mean of euclidean and cosine silhouette score.
Depending on the run, MiniBatchKMeans yielded the highest silhouette score of 0.937774 cosine similarity and 0.765362 euclidean distance at 3 clusters. The 3 clusters depict midnight to sunrise, mid-day hours, and sunset to sunrise. Had the dataset followed a 'time hour' that started at sun rise at 6:30 AM instead of 12:00 AM, the clustering would have been different in perhaps further distinguishing the time series clustering classification.
For this run, Agglomerative clustering came in second with a silhouette score of 0.92369 cosine similarity and 0.761246 euclidean distance at 3 clusters.
DBSCAN clustering had a silhouette score of 0.915988 cosine similarity and 0.702343 euclidean distance at 4 clusters. The 4 clusters depict sunrise sunset, noon, and nights before and after.
Hierarchical clustering also suggests 4 clusters at 500, although it is quantitatively not applicable to assess.
t-SNE uses stochastic neighbors, which means that there is no clear line between which points are neighbors of the other points. This lack of clear borders can be a major advantage because it allows t-SNE to naturally take both global and local structure into account. Local structure is more important than global structure, but points that are far away are not completely ignored, allowing for a 'well-balanced' dimensionality reduction.
Due to the non-linearity of the t-SNE, the components cannot be compared back to its original dataset
The benefit of t-SNE is that can handle the crowding problem very well with higher dimensionality reduction. The separation is better using t-SNE as it accounts for non-linear relationships as well.
gc.collect()
tsne = TSNE(n_components=3)
trim_data_tsne = tsne.fit_transform(trim_data)
sns.scatterplot(trim_data_tsne[:,0],trim_data_tsne[:,1]);
fig = pyplot.figure(figsize=(15, 10))
ax = Axes3D(fig)
ax.scatter(trim_data_tsne[:,0],trim_data_tsne[:,1],trim_data_tsne[:,2])
ax.set_xlabel('trim_data_tsne 1')
ax.set_ylabel('trim_data_tsne 2')
ax.set_zlabel('trim_data_tsne 3')
pyplot.show()
Euclidean Distance measures how far apart two points in a n-dimensional space and Cosine Similarity measures the angle between two points at vertex zero.
The harmonic mean of euclidean and cosine Silhouette score is introduced to quantify optimal number of clusters because one would want to optimize for both instead of for one over another in the evaluation process. Given that both range from -1 to 1, it would be ideal to optimize for both.
Due to the nature of unsupervised clustering, there is no previous labels to compare to after the clusters are made. Therefore, to choose the optimal cluster, the elbow of the
best_MiniBatchKMeans_model = (None, None, -1.0)
n_clusters = np.arange(2, 11, 1)
models = []
for n in n_clusters:
models.append((MiniBatchKMeans, {"n_clusters": n}))
for i, (Model, kwargs) in enumerate(models):
print(Model.__name__, kwargs)
silhouette = 0.0
model = Model(**kwargs)
model.fit(trim_data_tsne)
unique, counts = np.unique(model.labels_, return_counts=True)
avg_metrics, _, _ = cluster_analysis(trim_data_tsne,model.labels_, per_cluster=False)
silhouette_euclidean = avg_metrics['euclidean'][0]
silhouette_cosine = avg_metrics['cosine'][0]
silhouette = (2 * silhouette_euclidean * silhouette_cosine) / (silhouette_euclidean + silhouette_cosine) # harmonic mean
if best_MiniBatchKMeans_model[2] < silhouette:
best_MiniBatchKMeans_model = (model, kwargs, silhouette)
print("Silhouette harmonic mean of euclidean and cosine: ", silhouette)
print("The best results were with the MiniBatchKMeans model: ", best_MiniBatchKMeans_model[1])
print("The results obtained: ")
print("Silhouette harmonic mean of euclidean and cosine: ", best_MiniBatchKMeans_model[2])
n_clusters=best_MiniBatchKMeans_model[1]['n_clusters']
cls = MiniBatchKMeans(n_clusters=n_clusters, random_state=0)
cls.fit(trim_data_tsne)
labels = cls.labels_
centroids = cls.cluster_centers_
avg_metrics, ss_metrics, per_cluster_metrics = cluster_analysis(trim_data_tsne,labels)
plt.figure(figsize=(15, 10))
for cluster_id in np.unique(labels):
d = np.where(cluster_id == labels)
label = 'Cluster #' + str(cluster_id)
plt.scatter(trim_data_tsne[:,0][d], trim_data_tsne[:,1][d], label=label, s=1)
plt.xlabel("trim_data_tsne 1")
plt.ylabel("trim_data_tsne 2")
for i in range(n_clusters):
plt.text(centroids[:, 0][i], centroids[:, 1][i], i, fontsize=20)
plt.show()
fig = pyplot.figure(figsize=(15, 10))
ax = Axes3D(fig)
for cluster_id in np.unique(labels):
d = np.where(cluster_id == labels)
label = 'Cluster #' + str(cluster_id)
ax.scatter(trim_data_tsne[:,0][d], trim_data_tsne[:,1][d],trim_data_tsne[:,2][d], label=label, s=1)
for i in range(n_clusters):
ax.text(centroids[:, 0][i], centroids[:, 1][i],centroids[:, 2][i], i, fontsize=20)
ax.set_xlabel('trim_data_tsne 1')
ax.set_ylabel('trim_data_tsne 2')
ax.set_zlabel('trim_data_tsne 3')
pyplot.show()
display(avg_metrics)
display(per_cluster_metrics)
display(ss_metrics)
best_agglomerative_model = (None, None, -1.0)
n_clusters = np.arange(2, 11, 1)
models = []
for n in n_clusters:
models.append((AgglomerativeClustering, {"n_clusters": n}))
for i, (Model, kwargs) in enumerate(models):
print(Model.__name__, kwargs)
silhouette = 0.0
model = Model(**kwargs)
model.fit(trim_data_tsne)
unique, counts = np.unique(model.labels_, return_counts=True)
avg_metrics, _, _ = cluster_analysis(trim_data_tsne,model.labels_, per_cluster=False)
silhouette_euclidean = avg_metrics['euclidean'][0]
silhouette_cosine = avg_metrics['cosine'][0]
silhouette = (2 * silhouette_euclidean * silhouette_cosine) / (silhouette_euclidean + silhouette_cosine) # harmonic mean
if best_agglomerative_model[2] < silhouette:
best_agglomerative_model = (model, kwargs, silhouette)
n_clusters = silhouette
print("Silhouette harmonic mean of euclidean and cosine: ", silhouette)
print("The best results were with the Agglomerative model: ", best_agglomerative_model[1])
print("The results obtained: ")
print("Silhouette harmonic mean of euclidean and cosine", best_agglomerative_model[2])
avg_metrics['cosine'].values[0]
n_clusters=best_agglomerative_model[1]['n_clusters']
cls = AgglomerativeClustering(n_clusters=n_clusters)
cls.fit(trim_data_tsne)
labels = cls.labels_
# centroids = cls.cluster_centers_
avg_metrics, ss_metrics, per_cluster_metrics = cluster_analysis(trim_data_tsne,labels)
plt.figure(figsize=(15, 10))
for cluster_id in np.unique(labels):
d = np.where(cluster_id == labels)
label = 'Cluster #' + str(cluster_id)
plt.scatter(trim_data_tsne[:,0][d], trim_data_tsne[:,1][d], label=label, s=1)
plt.xlabel("trim_data_tsne 1")
plt.ylabel("trim_data_tsne 2")
# for i in range(n_clusters):
# plt.text(centroids[:, 0][i], centroids[:, 1][i], i, fontsize=20)
plt.show()
fig = pyplot.figure(figsize=(15, 10))
ax = Axes3D(fig)
for cluster_id in np.unique(labels):
d = np.where(cluster_id == labels)
label = 'Cluster #' + str(cluster_id)
ax.scatter(trim_data_tsne[:,0][d], trim_data_tsne[:,1][d],trim_data_tsne[:,2][d], label=label, s=1)
# for i in range(n_clusters):
# ax.text(centroids[:, 0][i], centroids[:, 1][i],centroids[:, 2][i], i, fontsize=20)
ax.set_xlabel('trim_data_tsne 1')
ax.set_ylabel('trim_data_tsne 2')
ax.set_zlabel('trim_data_tsne 3')
pyplot.show()
display(avg_metrics)
display(per_cluster_metrics)
display(ss_metrics)
best_dbscan_model = (None, None, -1.0)
eps = np.arange(1, 2.1, 0.1)
models = []
for eps_val in eps:
models.append((DBSCAN, {"eps": eps_val, "min_samples": 30}))
for i, (Model, kwargs) in enumerate(models):
silhouette = 0.0
model = Model(**kwargs)
model.fit(trim_data_tsne)
unique, counts = np.unique(model.labels_, return_counts=True)
if len(dict(zip(unique, counts))) > 1:
avg_metrics, _, _ = cluster_analysis(trim_data_tsne,model.labels_, per_cluster=False)
silhouette_euclidean = avg_metrics['euclidean'][0]
silhouette_cosine = avg_metrics['cosine'][0]
silhouette = (2 * silhouette_euclidean * silhouette_cosine) / (silhouette_euclidean + silhouette_cosine) # harmonic mean
if best_dbscan_model[2] < silhouette:
best_dbscan_model = (model, kwargs, silhouette)
print(Model.__name__, kwargs)
print("Silhouette harmonic mean of euclidean and cosine: ", silhouette)
print("The best results were with the DBSCAN model: ", best_dbscan_model[1])
print("The results obtained: ")
print("Silhouette harmonic mean of euclidean and cosine: ", best_dbscan_model[2])
eps=best_dbscan_model[1]['eps']
min_samples=best_dbscan_model[1]['min_samples']
cls = DBSCAN(eps=eps, min_samples=min_samples)
cls.fit(trim_data_tsne)
labels = cls.labels_
# centroids = cls.cluster_centers_
avg_metrics, ss_metrics, per_cluster_metrics = cluster_analysis(trim_data_tsne,labels)
plt.figure(figsize=(15, 10))
for cluster_id in np.unique(labels):
d = np.where(cluster_id == labels)
label = 'Cluster #' + str(cluster_id)
plt.scatter(trim_data_tsne[:,0][d], trim_data_tsne[:,1][d], label=label, s=1)
plt.xlabel("trim_data_tsne 1")
plt.ylabel("trim_data_tsne 2")
# for i in range(n_clusters):
# plt.text(centroids[:, 0][i], centroids[:, 1][i], i, fontsize=20)
plt.show()
fig = pyplot.figure(figsize=(15, 10))
ax = Axes3D(fig)
for cluster_id in np.unique(labels):
d = np.where(cluster_id == labels)
label = 'Cluster #' + str(cluster_id)
ax.scatter(trim_data_tsne[:,0][d], trim_data_tsne[:,1][d],trim_data_tsne[:,2][d], label=label, s=1)
# for i in range(n_clusters):
# ax.text(centroids[:, 0][i], centroids[:, 1][i],centroids[:, 2][i], i, fontsize=20)
ax.set_xlabel('trim_data_tsne 1')
ax.set_ylabel('trim_data_tsne 2')
ax.set_zlabel('trim_data_tsne 3')
pyplot.show()
display(avg_metrics)
display(per_cluster_metrics)
display(ss_metrics)
link = ward(trim_data_tsne)
sns.set(style='white')
plt.figure(figsize=(20,20))
dendrogram(link)
ax = plt.gca()
bounds = ax.get_xbound()
ax.plot(bounds, [500,500],'--', c='k')
plt.show()
After the number of clusters per algorithm was chosen by highest harmonic mean of euclidean and cosine silhouette score, the models are then compared to one another by the highest harmonic mean of euclidean and cosine silhouette score. The current subset looks like a protein nucleotide but had the full data been used, the clusters may have been more defined.
Depending on the run, MiniBatchKMeans yielded the highest silhouette score of 0.428545 cosine similarity and 0.246182 euclidean distance at 4 clusters. In another run, the cluster count is 6 depending on the subsetted data.
For this run, Agglomerative Clustering came in second with a silhouette score of 0.34196 cosine similarity and 0.207246 euclidean distance at 4 clusters. In another run, the optimal cluster count is 2.
DBSCAN clustering had a silhouette score of 0.137749 cosine similarity and 0.408141 euclidean distance at 700 clusters.
Hierarchical clustering suggests 19 clusters at 500, although it is quantitatively not applicable to assess.
To make sense of the dataset by cluster, one would bind the clustering label with the original dataset by row index and then group-by and/or sort by clustering label.
One can visually see how PCA clusters could be applied to the dataset. The addition of labels from the abstraction of the TSNE clusters would help make sense of the original dataset immensely.
PCA and TNSE are very different in data reduction. MiniBatchKMeans is the best algorithm with the highest silhouette score for PCA because of obvious clusters while MiniBatchKMeans is the best algorithm with the highest silhouette score for TSNE because of the simplest model works when the data reduced to strand-like connections. The simplest MiniBatchKMeans clustering algorithm is the best despite that other algorithms may work better in other circumstances.
gc.collect()
cls = MiniBatchKMeans(n_clusters=3, random_state=0)
cls.fit(trim_data_pca)
labels = cls.labels_
resultpca = pd.concat([trim_data, pd.DataFrame(labels,columns=['label'])], axis=1)
plt.rcParams['figure.max_open_warning']=40
colnames=list(resultpca.drop(columns='label').select_dtypes(exclude='O').columns.values)
for i in colnames[0:]:
facet = sns.FacetGrid(resultpca,hue='label',aspect=2)
facet.map(sns.distplot,i)
# facet.add_legend()
facet.fig.suptitle(''.join(map(str, list(["Figure ",colnames.index(i)+1,": ",i," Distribution By Label"]))))
plt.show()
cls = MiniBatchKMeans(n_clusters=4, random_state=0)
cls.fit(trim_data_tsne)
labels = cls.labels_
resulttsne = pd.concat([trim_data, pd.DataFrame(labels,columns=['label'])], axis=1)
plt.rcParams['figure.max_open_warning']=40
colnames=list(resulttsne.drop(columns='label').select_dtypes(exclude='O').columns.values)
for i in colnames[0:]:
facet = sns.FacetGrid(resulttsne,hue='label',aspect=2)
facet.map(sns.distplot,i)
facet.add_legend()
facet.fig.suptitle(''.join(map(str, list(["Figure ",colnames.index(i)+1,": ",i," Distribution By Label"]))))
plt.show()
Here is what the original dataset looks like per classifier label for both PCA and TSNE applications of clustering. The application of PCA dimensionality reduction in this case makes more sense because the clusters are well defined intuitively versus that of TSNE when the original dataset is labeled by the clusters. The clusters of PCA from minibatch kmeans show that the clusters clearly are defined by the hour from midnight to sunrise, sunrise to sunset, and sunset to midnight while that of TSNE minibatch kmeans are still mixed together.
The minibatchkmeans clustering after PCA dimensionality reduction did the best for the shape of the 'horseshoe' because it allowed for simple discretization without the usage of more complex models that may perform better had the original shape of the dataset been more complex when plotted.